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Abstract. One of the continuing challenges in any Earth science 
investigation is the discovery and access of useful science content 
from the increasingly large volumes of Earth science data and related 
information available online. Current Earth science data systems are 
designed with the assumption that researchers access data primarily by 
instalment or geophysical parameter. Those who know exactly the 
data sets they need can obtain the specific files using these systems. 
However, in cases where researchers are interested in studying an 
event of research interest, they must manually assemble a variety of 
relevant data sets by searching the different distributed data systems. 
Consequently, there is a need to design and build specialized search 
and discovery tools in Earth science that can filter through large 
volumes of distributed online data and information and only aggregate 
the relevant resources needed to support climatology and case studies. 

This paper presents a specialized search and discovery tool that 
automatically creates curated Data Albums. The tool was designed to 
enable key elements of the search process such as dynamic interaction 
and sense-making. The tool supports dynamic interaction via different 
modes of interactivity and visual presentation of information. The 
compilation of information and data into a Data Album is analogous to 
a shoebox within the sense-making framework. This tool automates 
most of the tedious information/data gathering tasks for researchers. 
Data curation by the tool is achieved via an ontology-based, relevancy 
ranking algorithm that filters out non-relevant information and data. 
The curation enables better search results as compared to the simple 
keyword searches provided by existing data systems in Earth science. 

Keywords: data curation, information retrieval, data albums, 
relevancy ranking 



Introduction 


One of the continuing challenges in any Earth science investigation is the discovery and 
access of useful science content from the increasingly large volumes of Earth science data 
and related information available online. Common approaches used in Earth science 
research, such as case study analysis and climatology studies, involve discovering and 
gathering diverse data sets and information to support research goals [1], Research based 
on case studies involves a detailed description of specific weather events using data from 
different sources to characterize physical processes in play for a specific event. 
Climatology-based research tends to focus on the representativeness of a given event by 
studying the characteristics and distribution of a large number of events. This allows 
researchers to generalize event characteristics such as spatio-temporal distribution, intensity, 
annual cycle, and duration. However, gathering relevant data and information for case 
studies and climatology analysis is both tedious and time consuming. 

Current Earth science data systems are designed with the assumption that researchers 
access data primarily by instrument or geophysical parameter. Those who know exactly 
which data sets they need can obtain the desired files using these systems. However, in 
cases where researchers are studying an event of interest, they have to manually assemble a 
variety of relevant data sets by searching different distributed data systems. For such 
scenarios, the search process needs to be organized around the event rather than observing 
instillments. In addition, the existing data systems are based on Boolean information 
retrieval models utilizing keyword matches. Such systems assume users have sufficient 
knowledge regarding the domain vocabulary to be able to effectively utilize the search 
catalogs. These systems lack support for new or interdisciplinary researchers who may be 
unfamiliar with the domain terminology. 

Consequently, there is a need to design and build specialized search and discover tools 
in Earth science that can filter through large volumes of distributed online data and 
information and only aggregate the relevant resources needed to support climatology and 
case studies. One specific scenario describing this need comes from hurricane science. 
Many hurricane researchers are familiar with limited, but specific, data sets, and are often 
unaware of or unfamiliar with a large quantity of other resources. Finding airborne or 
satellite data relevant to a given hurricane often requires a time consuming search through 
web pages and data archives. Background information related to damages, deaths, and 
injuries requires extensive online searches for news reports and official hurricane 
summaries. This search process could be made much more efficient and productive if a 
single system could provide not just links to airborne and satellite web sites but also to 
specific instmment data and satellite granules relevant to the hurricane(s) as well as all 
related reports, summaries, news stories, and images. Not only would such a system 
provide a richer resource of information to the researcher, it would also increase the 
exposure of existing data sets to a community that may be unfamiliar with or hesitant to use 



the data because of difficulties in tracking down the individual satellite overpasses or 
airborne flights relevant to a hurricane. 

This application paper presents a specialized search and discovery tool for Earth 
science to address these needs. The search tool automatically creates curated Data Albums. 
Data Albums are compiled collections of information related to a specific science topic or 
event, containing links to relevant data files (granules) from different instruments as well as 
tools and services for visualization and analysis and information about the event contained 
in news reports, images, or videos to supplement research analysis. Data curation by the 
tool is achieved via an ontology-based relevancy ranking algorithm that filters out non- 
relevant information and data. The search tool is novel in its use of aggregation, an 
ontology-based relevancy ranking algorithm to curate data, and information visualization 
and presentation. The key contributions of this paper are in two areas. First, the tool and its 
interface were designed to make the search process more intuitive and usable by supporting 
dynamic interaction mode and sense-making (as explained in the next section). Second, the 
tool uses an ontology-based relevancy ranking service for data curation, and this service 
provides better search results as compared to existing keyword search systems in Earth 
science. 

The paper is structured as follows. Section 2 summarizes existing research covering the 
models for the search process and the use of semantic technologies in spatial data systems. 
The architecture design and the implementation details of specific components are covered 
in Section 3. The architecture and the algorithm used for the ontology-based relevancy 
ranking service are presented in detail in Section 3.2.3. The search tool is designed to be 
general and reusable. It can be customized for specific applications, and an example of this 
is described in Section 4 where the tool is customized to support Hurricane science. This 
section describes both the application ontology needed for the customization and the design 
of the user interface to visually present information to support Hurricane research. 


1. Background 

1.1 Modeling the Search Process 

Search is defined in the dictionary as trying to find something by looking or otherwise 
seeking carefully and thoroughly. Research on search system design has focused on 
modeling the “information seeking” process. The standard information seeking model used 
as the basis for designing most search tools breaks the process down into four steps. The 
first step involves identifying the problem or task. The next step entails articulating the 
information need. The third step is to formulate the search query, and the final step involves 
evaluating the results from the query. The iteration in this model involves the steps of result 
evaluation and query formulation. The chief drawback of this model is that it assumes that 
the task remains static throughout the process. Bates [2] proposes a Dynamic Model as an 
alternative to the standard model. In this model, the user’s needs evolve during the seeking 
process as he or she interacts with the retrieved information. This mode of information 
seeking is also referred to as “information foraging” [3]. This dynamic model captures the 



user’s “discovery of latent needs” where interaction with retrieved information leads to new 
and unanticipated goals. 

The other component of the search relevant to the design of scientific data systems 
besides the process itself is sense-making. Sense-making is defined as the behavior of 
intelligence analysts and other knowledge workers related to information seeking and use 
[4]. Pirolli and Card [5] define a framework for sense-making for intelligence analysts as a 
set of stages, with each stage utilizing the retrieved information. During the first stage 
(called a shoebox), the analyst gathers all relevant documents into a single collection. In the 
second stage, the contents of the shoebox are curated in order to be filtered further. This 
stage is labeled as the evidence file by Pirolli and Card. The final stage in the sense- 
making framework uses the information gathered in the evidence file to build a schema, a 
mental model of how all this information fits together. 

Most Earth science data system search tools are built based on the standard information 
seeking model with limited or no capability to support the aforementioned “discovery of 
latent needs”. The design of Data Albums supports the dynamic information seeking mode 
and models the sense-making framework. The auto-curated Data Albums serve as a 
shoebox for researchers using this tool. The contents of a Data Album can then be filtered 
depending on a researcher's specific need. 

1.2 Use of Semantic Technologies within Spatial Data Infrastructures (SDIs) 

While this list is by no means exhaustive, existing research focused on augmenting spatial 
data infrastructure with semantic components can be categorized into three distinct classes 
based on their application. These application classes are: semantic mediation and query 
expansion; knowledge discovery (covering search, retrieval, integration, and interpretation); 
and information retrieval. 

1.2.1 Semantic Mediation and Query’ Expansion 

Klien et al. [6] used semantic technologies to address the problem of semantic 
heterogeneity due to synonyms and homonyms for keyword searches in catalogues. These 
authors used an ontology-based discovery service to mediate between the catalogs. In their 
prototype, each catalog has its own application ontology and the mediation is achieved by 
linking these application ontologies to a common shared vocabulary. Klien et al. [6] noted 
the limitations of such an approach, as it works in only well-defined domains where small 
groups commit to shared vocabulary. 

As part of our earlier semantic research work with the Noesis search tool [7], we 
explored the use of semantic web technology for search within Geosciences. The Noesis 
tool was designed as a customizable search engine that uses domain knowledge captured in 
ontologies. The use of domain ontologies allows the richness in the relationships between 
the domain concepts to be captured and used by a reasoner. As advocated by Klien et al., 
[6], Noesis also used mediation ontologies to mediate between the domain ontology terms 
(user view) and the vocabulary terms used in different resource catalogs (application view). 
Noesis provided a simple user interface to allow users to start the search from any point 



within the ontology. If the term entered is defined in the ontology, the reasoner was used to 
infer all the related terms for automated search query expansion. 

1.2.2 Knowledge Discovery > covering search, retrieval, integration and interpretation 

The use of semantic technology to support the knowledge discovery cycle within Spatial 
Data Infrastructures (SDIs) has been actively researched. This knowledge discovery cycle 
covers multiple steps including search, retrieval, data integration, and interpretation. 
Kammersell and Dean [8] described the concept of “Conceptual Search” where the 
semantic layer is integrated within a Service Oriented Architecture. Using a prototype, the 
authors demonstrated the utility of the semantic layer to map user queries into component 
semantic queries which in turn query distributed catalogs and databases. The use case 
implemented by the prototype demonstrated the value of incorporating a semantic layer for 
data discovery and integration to answer complex geospatial queries. 

Lutz and Kolas [9] investigated the use of semantic technology within a spatial data 
infrastructure to answer prototypical geospatial queries. These queries required addressing 
semantic heterogeneity for discovery and semantic interpretation of distributed schemas for 
data integration. Using a prototype, authors demonstrated the use of schema mapping rules 
in building descriptions of the content and schemas of the data sources in SDIs and domain 
rules to infer new unknown facts that can be used to answer the user’s question. 

Lutz et al. [10] explored the use of semantics within a SDI at three different levels: a 
Metadata level focused on discovery of geographic information; a Schema level focused on 
retrieval; and a Data content level that addressed the interpretation, integration, and 
exchange of geographic information. A hybrid ontology architecture in conjunction with a 
DL reasoner was used to make the semantics of the information content of the geospatial 
web services explicit. The hybrid architecture consisted of application ontologies for each 
individual information systems within the SDI and a common shared vocabulary for the 
domain. 

Peng et al. [11] designed a prototype with a semantic middleware to automate the 
generation of new knowledge-added (data) products via workflows. Their prototype 
combines geospatial data and service discovery and process models into automatic service 
compositions including process modeling, process model instantiation, and workflow 
execution. The process models (or “virtual data products”) in the prototype are represented 
via workflow ontologies, and these models link to geoprocessing workflows or service 
chain instances. To answer user queries, these workflows can be executed on the fly to 
generate required data products. 

1.2.3 Information Retrieval 

Semantics have been incorporated into the information retrieval systems, and ranking 
algorithms for both improving the precision and recall of the searches and also for adding 
additional parametric dimensions for similarity measure. For example, Liu et al. [12] 
describes an approach for utilizing semantics for geographical Information retrieval that 



goes beyond concept mediation between a user’s cognitive model and the system and 
extends it to also resolve spatial relationships between objects. 


Li et al. [13] focused on developing a methodology to measure semantic similarity 
between spatial objects by combining Description Logic-based knowledge and a multilayer 
neural network. DL is used to generate features that are then used as input for the neural net. 

Andrade et al.[ 14] describe a framework to improve information retrievals within an 
SDI by extending the traditional IR concepts to include additional dimensions such as space 
and time in addition to themes. Similarity of a geospatial feature to the user's query is 
computed as a weighted sum of spatial, thematic, and temporal relevancy rankings. 

Our Data Albums tool utilizes semantic components within its information retrieval 
system and relevancy ranking algorithm. Our approach is described in detail in section 
2.2.3. 


2. Application Design 

This section describes the Data Albums tool architecture, components, and implementation 
methodology. Details of the semantic component of the tool, including the ontology-based 
relevancy ranking service, are given in section 3.2.3. 

2.1 Functional Architecture Overview 

Data album creation is triggered by a real time or a historical database-driven event feed 
that provides spatiotemporal constraints for a given event occurrence such as a hurricane. 
The search tool needs to be pre-configured with a list of the different online resources it can 
query. Each resource has a search broker. Using this pre-configured resource list, the tool 
performs both data and information searches for relevant coincident observations and 
information. Retrieved search results are aggregated and presented to the user as an 
interactive Data Album containing compiled information regarding relevant data for 
specific events. This data is filtered based on geophysical parameters, geolocation, time, 
and semantic relevance. Users can review the list of data sets and, based on the information 
provided, obtain links to individual data files or data access services. 

An application ontology describing the phenomena of interest is used by the relevancy 
ranking algorithm to provide a curation service to the search tool. The users can modify the 
semantic relevancy threshold to increase or decrease the amount of relevant data sets 
aggregated by the tool. The Data Albums tool also compiles supplementary information to 
augment research, including news articles, reports, images, and videos detailing the event 
and its socio-economic impacts as well as other useful information such as weather reports. 
The Data Albums search tool is designed using a standard thin client web server 
architecture as seen in Figure 1. There are three layers of this architecture: data, service, 
and presentation. The data and service layers reside on the web server, whereas the 



presentation layer is contained within the browsers. Different components within these 
layers are described below. 
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Figure 1: Client Server architecture for the Data Album tool depicting different layers and functional components 


2.2 Data Layer Components 


2.2.1. Aggregation Engine 

As the component responsible for aggregation, the aggregation engine invokes the different 
brokers that interface with the external information and data repositories. The engine can be 
invoked manually by the administrator or run periodically using a cron. The event broker 
provides all the information regarding the events, i.e., location and time. The data broker is 
invoked and executed using coarse grain parallelization, as the data queries for some 
instances can be quite lengthy 1 . The NASA ECHO catalog holds information for over 6000 
data collections and millions of individual data files/granules. These queries are run in the 
background and the tool stores the results locally. 


1 For example, searching [15] for Hurricanes from 1950-2012 takes around 3 months of processing on a 
regular web server. 


2. 2. 2. Brokers 


Data Search Broker - This broker is designed to query the NASA ECHO catalog. The 
broker uses the ECHO REST APIs [16] to get collections and granules. The data is 
retrieved for the entire duration of any given event. The data search broker uses a bounding 
box as a buffer for individual event location points to search for granules in the catalog. In 
order to make the query efficient, data granules from different collections are searched for a 
given time and spatial location. The broker optimizes the search queries using a temporal 
resolution of the data collection to avoid gathering redundant data granules within a Data 
Album. For example, if two points of an event are 6 hours apart, the broker should only get 
the granule whose time resolution covers both the time events once. These rules also make 
the query process more efficient as redundant queries are avoided. The data granule 
metadata are returned in an XML format and ingested into a MySQL database. The data 
search broker can be invoked on the command line and run in parallel on multiple machines 
to speed up the querying process. The broker also logs both download time and failures, 
which is useful for monitoring purposes. 

Event Broker - Event information is generally located in some external database or 
website. A custom broker is needed to gather all the event information from these external 
resources. For example, the Data Albums instance for Hurricanes uses the HURD AT 
database [17]. The HURD AT event broker downloads and then parses the HURD AT track 
information. Each point in a storm track is stored in a MySQL database. All other search 
brokers within the data layer use this event information to construct spatio-temporal queries. 

Rule-Based Parser Broker - This broker is used to parse unstructured documents such 
as text files to extract information from different websites and documents. The broker uses 
regular expressions for parsing. Complex processing rules can be constructed by chaining a 
sequence of regular expressions to match a specific pattern within the unstructured texts. 
These rules include options to skip spaces or ignore a certain number of characters. For 
example, we can use this broker to match the certain text patterns in unstructured storm 
reports to extract storm statistics from the NOAA website. The broker converts the 
extracted information into an internal information model and stores it in a MongoDB, a 
NoSQL database engine [ 1 8] 2 . 

Another application of this broker is its use for parsing existing NOAA reports 
containing assessments of their ability to forecast for the genesis, intensification, and track 
of a hurricane. Extracting this information is important because generally researchers are 
interested in studying hurricanes where the forecasts were incorrect and learning why. 
Specific rules were defined to parse out just the key sentences in these reports. 

Other Brokers - Additional brokers allow mediation between the resource API (such as 
YouTube and Wikipedia) and the search tool. The returned results are mapped back to our 
information model and stored in the MongoDB to be used by the presentation layer 
components. 


2 A NoSQL database is more efficient than a traditional SQL database in storing unstructured information . 



2.2.3. Ontology-Based Relevancy Ranking Service 

There are different ranking models to assess the relevancy of documents. Typical database- 
driven search systems use a Boolean model that only return documents containing the query 
keyword without any ranking. Statistical models such as tf x idf are the most commonly 
used approach to document ranking. Algorithms such as PageRank are examples of 
Hyperlink models. Ontology-based models are a type of Concept model that map the text in 
a document to concepts in an ontology via an annotation process; different algorithms can 
then be used to calculate weights for these annotated terms to measure the relevancy of the 
document. 

In addition to these models, there are hybrid approaches that combine multiple models. 
For example, ORank [19] is an ontology -based system for ranking documents that 
considers the conceptual, statistical, and linguistic features of a document. ORank 
improves precision of the search results over existing statistical model-based approaches by 
using concept instances in the document. Bouramoul et al. [20] used ontologies not only 
for document indexing and query reformulation, but also as an important part of improving 
web search results by modifying information filtering and ranking algorithms. This 
approach also provides a general architecture with functional components for incorporating 
semantics in the search process. Based on their experiments using existing search engines, 
[20] find that incorporating semantics leads to improvements in the search results. 

Our approach for the relevancy ranking service utilizes a hybrid model combining 
statistical and concept models and is influenced by both the ORank system [19] and the 
general architecture described in [20]. Implemented in Java and deployed as a Tomcat Web 
application, this service is designed to be a general-purpose service that can be customized 
and re-used by other applications. The service requires an application ontology, access to 
collections of documents (metadata for a data set) to be ranked, and a set of theme 
keywords. The service then provides a list of ranked documents along with their relevancy 
score. For a given set of theme keywords, the service uses an algorithm that combines both 
ontology-based and traditional statistical scores to estimate the relevancy of a resource. The 
service is broken into five modules: ontology processing, information extraction, 

annotation, statistical analysis, and ranking. The individual components of the service and 
their interaction are depicted in Figure 2. 




Figure 2: Ontology-based relevancy ranking service components and their functional interactions. 


Ontology Processing Module - This module is responsible for inference, calculating 
weights for individual ontological links, and calculating activation values for concepts in 
the ontology. The module utilizes the Protege OWL API with the Hermit OWL Reasoner 
[21] as the inference engine to provide relationships between all the concepts in the 
ontology. The ontology is treated as a graph, and weights are calculated for links between 
two concepts using the following equation: 
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W(Cj, c 0 is the weight between C, and Ck 
,=1 is total number of related concepts to both C, and Ck 



is total number of related concepts to Cj 
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,_1 is total number of related concepts to Ck 

The numerator in the formula for the weight W(c c 0 represents the total number of 
concepts in the ontology which are related to both concepts (Cj and Ck) and therefore 
contribute to their similarity. 

The denominator in the formula scales the value of the similarity with the total number 
of related concepts related to each concept Cj and q. The obtained link weight for each 
concept is stored in a database. 

Once the weights for the links connecting all the concepts in the ontology have been 
calculated, an activation value is calculated for the concepts themselves. A Spread 
Activation algorithm [YY] is used to for this purpose. For certain concepts identified as the 
most pertinent for thematic application, the activation value is set as one. For our 
application to support hurricane science, the activation value for the concept of hurricane in 
the ontology is set as one. The activation values for the remaining concepts within the 
ontology are calculated using the following formula: 

/(c y ) = /(q) x W (cj, q) 

where: 

/ (c) is activation value of current concept 

1(c) is activation value of previous concept 

W (c,-, Cj) is link weight between two concepts 

These activation values are stored in a database and used during the ranking process. 

Information Extraction Parser - This parser receives a set of documents as input and 
extracts the text from each of them. The extracted text is stored in memory for annotation. 

Annotation Module - This module is responsible for searching, parsing, and annotating 
the documents based on the concepts from the application ontology. The module queries 
the database for stored concepts and relationships. The query returns all the terms related to 
the queried keyword. Each document is searched for all the concepts from the application 
ontology. The matching concepts from the document are saved in memory for statistical 
analysis. 

Statistical Analysis Module - This module uses the statistic tf-idf formulation for each 
matched keyword in a document to estimate its weight. Term Frequency (tf) measures the 
frequency of each concept word (c) in each document (d). 

\d\ 

where: 

tf (c, d) is frequency of concept word (c) in document (d) 

\d\ is count of words in document (d) 



Inverse Document Frequency (idf) measures the importance of each concept word (c) in the 
entire document collection (D): 

idf (c, D) = log( I " 1 ) 

df c 

where: 

idf (c, D) is inverse document frequency of (c) in corpus(D) 

\D\ is count of the corpus (D) 

df c is document frequency of (c): the number of document that contain (c) 

The tf-idf for a matching keyword is calculated as below: 
tf idf (c, d, D) =tf(c,d)x idf (c, D ) 
where: 

tf.idf ( c,d,D ) is the tf.idf for (c) in (d) that is in corpus ( D ) 

The tf.idf value is saved in memory and used in the ranking calculation. 

Ranking Module - This module calculates a relevancy score for each document. The 
formula to estimate the score (S,j) for a document (d) in the collection of documents ( D ) is 
as follows: 

m 

S d=Y, 7 ( c i) x tf-idf (c.,d, D) 

i = 1 

where: 

Sd is final score of document (d). and 

m 

5 ] 1 ( c t) x tf.idfie^d , D) 

i = 1 

is the sum of the score of each matched concept and is calculated by multiplying the 
concept and statistical scores. 

S d for a given document (d) is the sum of the scores of each matched concept/keyword 
and is obtained by multiplying the activation values for the concept and the statistical score. 

2.3 Service Layer 

RESTful Data API Module - This module provides an API which supplies faceted search 
capabilities in the client. The API interfaces with both the MongoDB and MySQL in the 
data layer and enables queries to their holdings. This component also caches requests to 
enable efficiency and sends output responses in JSON. The JSON responses are 
compressed using gzip compression to enable faster transfers. 



Analytics Module - This module uses the R [22] statistical analysis package to generate 
graphs and charts. R is used in conjunction with the ShinyR and Node.js to provide the UI 
for the analytics module for one of the applications. 

2.4 Presentation Layer 

The presentation layer uses open source libraries such jQuery, jQuery UI, D3.js, 
OpenLayers, jqGrid, and Lightbox to create the interactive user interface and present the 
information visually. jQuery provides cross browser common utility functions. jQuery UI 
provides popup boxes, theme templates, drop down menus, sliders, and other UI elements. 
D3.js is a JavaScript-based library for manipulating documents based on data [23]. D3.js 
provides powerful interactive visualization components using a data-driven approach to 
DOM manipulation while making full use of the capabilities provided by modern web 
browsers. Sunburst, Bubble chart, and Treemap visualizations of information are built using 
D3.js library. OpenLayers is used to display storm track information as well as to display 
MODIS imagery. jqGrid [24] is used to list the available data granules for the current 
selection of data sets in a paged grid view UI component. Lightbox provides a slideshow 
plugin for viewing images and playing YouTube videos. The presentation layer accesses 
REST APIs provided by the service layer to fetch stored information and provide faceted 
search using AJAX. It is possible to build entirely new presentation layers using these 
REST APIs for web, desktop, or mobile clients. 

It is important to note that the Data Albums architecture by design is modular to allow 
reuse of components as different portal instances are created for different science 
applications. Every new instance requires its own application ontology and an event broker. 
The presentation layer requires minor modifications to be customized for the specific 
science application needs. Other brokers and infrastructure components can be reused as is. 
An example science application is described in the following section. 


3. Science Application: Catalog of Hurricane case studies 

The study of an individual hurricane or a set of hurricanes requires information and 
scientific data related to the storm(s). We have customized the Data Albums tool to create a 
Hurricane Case Study portal. This portal compiles information and links to data granules 
from multiple instmments, organized by specific hurricane events. The portal is unique in 
that it supplements data set information with factual content such as news, weather reports, 
images, and videos, all of which are potentially extremely useful for case study analysis. 
The Hurricane Case Study portal uses the database of storm tracks for Atlantic and Eastern 
Pacific cyclones provided by the National Hurricane Center (NHC) to trigger the 
aggregation and curation of data. The portal, which is accessible via a variety of interfaces, 
provides a collection of data subsets and imagery of satellite observations of these 
hurricanes as well as data collections from NASA hurricane field campaigns including 
airborne observations, radiosondes, radar, and mission reports. 



The presentation layer for the Hurricane portal is designed to provide an easy visual 
overview of all the event holdings in the catalog. The presentation layer also searches for 
case studies based on different feature dimensions such as storm intensity and duration and 
then enables a researcher to drill down to a Data Album view for a specific hurricane event. 
The Data Album view allows researchers to quickly assess the richness of the aggregated 
information and access data files needed for their investigation. 

3. 1 Hurricane Ontology’ 

An application ontology is needed to customize the relevancy ranking service. 
Consequently, a hurricane ontology was designed with the help of hurricane researchers to 
capture key concepts such as when and where hurricanes occur, the different stages of 
hurricane life cycles and the physical parameters used to characterize them, the 
measurements used to study hurricanes, etc. The key concepts and relationships within this 
ontology are described briefly below. 

Phenomena is used as the base class for all atmospheric phenomena associated with 
hurricanes. Phenomena contain several subclasses: Cyclone, Flood, LandFall, Lightning, 
Rain, StormSurge, Tornado, TropicalDisturbance, and WindGust. 

The Cyclone class defines phenomena characterized by a rotating flow of large scale 
air masses; Hurricane is one such a phenomenon, and hence a subclass of Cyclone. The 
Cyclone class includes other three relevant subclasses: Typhoon, a similar phenomenon that 
occurs in West Pacific Ocean, TropicalDepression, and TropicalStorm, the latter two of 
which are less developed cyclone systems. 

The Parameter base class is defined as a class containing all atmospheric parameters 
used to characterize a hurricane. Some of the subclasses of Parameter include Pressure, 
TemperatureProfile, RainRate, Reflectivity, WindSpeed, Velocity, StormSurgeHeight, and 
SurfaceTemperature. 

Component is the base class used to define all the hurricane components including 
CirroStratus, Cirrus, CloudBand, Eye, CumuloNimbusCloud, EyeWall, LowPressure, and 
Thundercloud. 

All the environmental conditions needed for hurricane formation are defined under the 
Condition base class. Example subclasses include Coriolis, representing coriolis force on 
the rotation of a hurricane; WarmOceanSurfaceWater, which is the warm ocean surface 
water needed for the formation of hurricanes; and CyclonicRotation, representing the 
cyclonic nature of hurricane morphology. 

The spatial and temporal information of hurricanes is captured using the Location and 
Occurrence base classes. Example subclasses of Location include TropicallndianOcean, 
TropicalNorthAtlantic, and TropicalWesternNorthPacific. 

Other important concepts covered in the ontology include Casulty’AndDamage, 
IntensityScale, DataSourceType and DataSourceOrigin. 

A number of object properties which relate these classes to each other are also defined. 
The hasParameter property, with its domain as the DataSet class and its range as the 
Parameter class, can specify the parameters that a data set contains. The hasComponent 
property relates Component to Cyclone. The hasLocation property has Phenomenon as its 



domain and Location as its range, specifying the location of phenomena. The 
hasDataSourceOrigin and hasDataSourceType properties are used to relate the data set 
source's origin and type, respectively. The hasIntensityScale property relates the intensity 
scale to a cyclone. 

3.2 Visual Overview Interface 

The main objective for the portal landing page (Figure 3) is to allow researchers to browse 
and compare different events suitable for a case study analysis or allow them to select a 
specific case based on year, category, and name. This view consists of the Analytics and 
Visual faceted search panels. 



Figure 3: The main page for the Hurricane portal built using the Data Albums tool. The main page provides 
researchers a high level view of all the events and allows them to interact with and analyze the curated catalog 
holdings. 


3.2.1 Analytics Panel 

The Analytics panel provides users with tools to interactively and visually analyze some of 
the HURD AT parameters. This panel has three tabs: Interactive, Box Plot, and Track Data. 

The Interactive tab displays a scatter plot which can be customized to visualize 
different parameters from HURDAT data - average pressure, average wind speed, 




maximum pressure, maximum wind speed, minimum pressure, and storm length. These 
parameters can be displayed in either linear or log mode by changing the axes setting of the 
scatterplot. Four different variables can be displayed on the scatter plot at one time. In 
addition to the two parameters selected as the x and y axes, two additional parameters can 
be assigned to color and size. In addition, the scatter plot provides a time slider which 
allows the user to visually analyze different parameters for different years. 

The Box Plot tab provides a widget to plot box plots. Box plots are commonly used to 
graphically view groups of data based on quartiles and are a convenient way to identify 
outliers in the data. Users can create box plots for pressure, maximum sustained wind speed, 
and duration parameters. Researchers could identify possible hurricane events for case 
study analysis by using these outliers. Users can also download the data regarding 
hurricanes, tropical storms, and tropical depression tracks and intensities from the Track 
Data tab. 

3.2.2 Visual Faceted Search Panel 

While the previous panel focuses on providing users with visual analytics capabilities 
regarding the storm track data, the Visual faceted search panel allows users to search the 
catalog and drill down to the Data Album for a specific storm. The panel also provides 
three different tabs for searching: the Sunburst, Bubble Chart, and Classic tabs. 

The Sunburst tab allows a user looking for a specific storm to specify the year, 
category, and storm name from the pull down list. It allows the user to graphically explore 
the storms in the catalog. For example, after selecting specific years of interest, the user can 
drill down to a specific year. The Hurricanes are displayed as a wheel, arranged counter 
clockwise and color-coded based on their Hurricane intensity category. 

The duration of these storms is represented in the angle of the wedge in the display. 
This interface allows a researcher to interactively browse through different storms and then 
find an event of interest based on category or duration suitable for their case study. 

The Bubble Chart tab provides a different visualization of the search metadata stored 
in the catalog. Users can select a specific year; the storms associated with that year are 
presented as bubbles, the color of which represents the storm category and the size of which 
represents the total duration of the storm. 

The Visual faceted search panel also provides the listing of the storms in a traditional 
tabular form in the Classic tab. All the storms for a given year are listed as a column, sorted 
alphabetically and color coded based on the storm category. This view is specifically 
designed to meet the needs of domain experts. 

Once a user has narrowed the search down to a specific storm, double clicking on the 
storm name from any of the three panels launches the associated Data Album view. 

3.3 Data Album View 

The Data Album view (see Figure 4) allows the researcher to drill down for a specific 
hurricane event and presents all the aggregated information stored in the catalogs for the 
specific storm. The Data Album page has four main panels: Storm Statistics and 
Multimedia, Storm Track, Collections based on Keywords or Instruments, and Granules. 
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Figure 4: The Data Album view provides the researchers with the deep dive functionality they need 
to look at all the aggregated information and curated data gathered by the tool for a specific hurricane 
event, assess data richness, and drill down to the individual files needed to support their study. 


3.3.1 Storm Statistics and Multimedia 

This panel contains a description of the storm in question gathered from Wikipedia. It also 
has a storm statistics table generated from parsing NOAA storm reports. The table contains 
useful information such as cyclogenesis and cyclosis date, highest Saffir Simpson category, 
maximum wind speed, minimum central pressure, damage, and errors found in genesis, 
track forecast, and intensity data. Youtube videos related to storm forecasts and damage 
reports are also presented here. 


3.3.2. Storm Track 

An interactive map with the storm track is displayed to provide spatial reference. Each 
point on the track provides the date and time of the storm at that location along with its 
intensity information. The map also provides two options: a basic map with a standard 
political map as the background, or one containing MODIS imagery as the background. The 
MODIS imagery is accessed in real-time using NASA’s GIBS service API [25]. 




3.3.3 Data Collections based on Keywords or Instruments 

Similar to the Visual faceted search panel on the main portal page, this panel presents the 
relevant data collections to the end user in any of the three views: Sunburst, Tree Map, or 
Classic. The user can refine his or her search process by selecting specific keywords or 
instillments. Selecting a keyword allows the user to quickly refine the search query and see 
only the curated data. 

In the Sunburst view, the angle in a wedge for a keyword/instrument/data collection 
visually represents the number of data granules/files. This allows researchers to visually 
identify the information richness of different data collections. The TreeMap is generally 
used in information visualization to show disk usage and which directories contain the most 
files. It has been customized here to show the files/granules available based on keywords, 
instillments, or collections. The Classic view displays the search results in a tabular form. 

The slider on the top right corner of the panel can be used to adjust the relevancy 
threshold. If a researcher wants to see more data collections, he can lower the relevancy 
threshold value. Similarly, increasing the threshold value filters and limits data collections 
presented in the album. 

3.3.4 Granules 

The results displayed on this panel are determined based on the selections made on the 
previous panel. This panel lists the all the granules or the data file names in a table along 
with relevant data center and URL links. If a user selects a specific keyword or an 
instillment in the previous panel, this list automatically gets filtered. Researchers can now 
select specific granules or all granules given in the list. Clicking on the button at the bottom 
of the panel generates a list of links to these granules. This list can be copied and is 
typically used by researchers writing their own scripts to access and analyze the data. 


4. Discussion 

We evaluated the value of using an ontology-based relevancy ranking service to provide 
data curation by comparing results against an existing search system [15] that supports only 
keyword-based searches. First we narrowed the data set listings in NASA’s ECHO catalog 
to focus on only one of the twelve data centers, namely Global Hydrology and Resource 
Center (GHRC). Using our domain knowledge and expertise, we manually identified 35 out 
of 160 GHRC data sets as the most relevant for studying hurricanes. The occurrence of the 
keyword ‘hurricane’ in a data set's metadata and the importance of the measured 
atmospheric parameters to hurricanes were some of the criteria used in selecting these data 
sets. These 35 data sets serve as our truth data for evaluating the performance of the ranking 
service. 

Searching the NASA ECHO catalog using ‘hurricane’ as the keyword only returned 6 
GHRC data sets that were in our list of truth data, with a recall of only ~ 0.17. The 
evaluation result for the ranking service depends on the relevancy threshold. The relevancy 
threshold can be set low for lower precision and higher recall, or set high for higher 



precision and lower recall. For example, setting the relevancy threshold below 0.00099 
returns all 160 GHRC data set with a precision of 0.21 (35/160) and recall of 1. The best 
combination of precision and recall is reached at relevancy threshold of 0.033. For this 
threshold, total of 22 G1TRC data sets are returned and 19 of them match with the truth data 
list. Flence, the service provides a precision of 0.863636 and recall is 0.542857. The 
relevancy threshold of 0.033 is set as the default value in the tool. 

While these evaluation results show a marked improvement in search results using an 
ontology-based relevancy ranking service over existing traditional keyword search, there 
are still issues and challenges that need to be addressed. First, the existing relevancy 
ranking algorithm can be enhanced to improve its performance. Specifically, the existing 
algorithm uses the ontology primarily as a graph to calculate the concept weights and does 
not fully utilize the meaning of relationships between concepts described in the ontology to 
assign weights. Second, the performance of the relevancy ranking algorithm is dependent 
on the richness of the ontology. Constructing the ontology without assessing the metadata 
descriptions of the data sets can lead to poor performance. Finally, one of the major 
challenges of utilizing ontologies is the inability to scale them to cover large domains. 
Constructing ontologies is still a tedious manual process. For ontologies to be used in 
operational tools in Earth science, new methods or approaches are needed to automatically 
create ontologies from the existing corpus of documents. Until then, ontologies will be used 
mostly in prototypes for demonstrations or in operational tools addressing niche search 
problems. 


5. Summary and Future Work 

This paper presents a specialized search and discovery tool that automatically creates 
curated Data Albums. The design of the tool was tailored to support the key elements 
within a search process, specifically enabling dynamic interaction and sense-making. The 
tool supports dynamic interaction with, and visual presentation of, information. The 
compilation of information and data into a Data Album is analogous to a shoebox within 
the sense-making framework. This tool automates most of the tedious information/data 
gathering tasks for researchers. Data curation by the tool is achieved via an ontology-based 
relevancy ranking algorithm that filters out non-relevant information and data. This 
curation enables better search results as compared to the simple keyword search provided 
by existing data systems in Earth science. 

As part of our ongoing research and development, we have started building another 
instance of the Data Albums tool focusing on severe weather with a focus on selecting 
appropriate case studies for researchers to evaluate their numerical models. This work is a 
collaboration with NASA’s SPoRT Center which conducts research on unique NASA 
products and capabilities that can be transitioned to the operational community to solve 
forecast problems in the 0-48 hour timeframe. SPoRT routinely runs a mesoscale 
configuration of the WRF model that uses NASA data sets in near real-time to evaluate the 
impact of NASA’s satellite data in improving convective model forecasts. The severe 



storm instance of Data Albums will support SPoRT researchers in their model evaluation 
studies. 
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